Introduction
略
Matrix Multiplication
利用矩阵乘法介绍优化方案
采用 Python、Java、C 的运行时间不一样
Why is Python so slow and C so fast?
- Python is interpreted.
- C is compiled directly to machine code.
- Java is compiled to byte-code, which is then interpreted and just-in-time (JIT) compiled to machine code.
i、j、k 循环调换位置后运行时间不一样
- cache hits 和 cache misses
Clang 优化
不一定 O3 优化比 O2 优化快,有时候 O2 快,有时候 O3 快
Parallel Loops
The cilk_for loop allows all iterations of the loop to execute in parallel.
Rule of Thumb Parallelize outer loops rather than inner loops.
进一步优化-重用数据(tiling)
- Restructure the computation to reuse data in the cache as much as possible. (Cache misses are slow, and cache hits are fast.)
- Try to make the most of the cache by reusing the data that’s already there.
Parallel divide-and-conquer
cilk_spawn
/*
The child function call is spawned, meaning it may execute in parallel with the parent caller.
*/
cilk_sync
/*
Control may not pass this point until all spawned children have returned.
*/
Compiler Vectorization(编译矢量化)
Many machines don’t support the newest set of vector instructions, however, so the compiler uses vector instructions conservatively by default
更多方法
- Preprocessing
- Matrix transposition
- Data alignment
- Memory-management optimizations
- A clever algorithm for the base case that uses AVX intrinsic instructions explicitly
综上优化效果